Vanishing Gradients
If the activation function's gradient is less than one on average, gradients go to zero as they layers increase. This kills the signal and network doesn't learn.
Sigmoid's derivative is at maximum, so it's very prone to this problem. Tanh's derivative is at maximum, it also vanishes the gradient.
It's solved by:
- ReLU, by keeping derivative constantly.
- Residual connections by keeping a residual path, , which always adds one even when activation function's gradient is 0.